Home

Curve Parameter Space

September 27, 2013

One of the more common misconceptions people have regarding computational geometry has to do with curve parameters. Since the mathematics of Nurbs curves are rather involved —much of it is certainly beyond high-school level— it is quite difficult to explain how control-point coordinates, control-point weights, curve degrees and knot-vectors all conspire to complicate the parameterization of a nurbs curve. My own grip on Nurbs maths is slippery at best so I think it’s better for everyone involved if I start this discussion using much simpler types of curves.

But before we can start talking about the problem, we should first lay down some ground rules for what sort of shapes qualify as ‘curves’:

  • A curve is a finite, one-dimensional object, existing in some space with an arbitrary number of dimensions. In this blog post I’ll only concern myself with two-dimensional and three-dimensional spaces as we are all familiar with them.
  • Every curve must have exactly two endpoints. No more, no less.
  • If the endpoints coincide, the curve is considered to be closed.
  • There may be no gaps on the interior of a curve, as that would result in more than two endpoints.
  • There may be no branching points on the curve, except where an endpoint is coincident with some interior point of the curve.

Basically, think of all possible curves as liquorice laces, you can stretch, bend, twist and kink them, but ultimately it’s nothing more than a bunch of deformations applied to a straight piece of chewy goop. Note that the definition above differs from the common mathematical definition of curves. Mathematicians tend to be far more inclusive and rigorous.

Read the rest of this entry »

Another Galapagos Tutorial

September 17, 2013

I’ve been thinking about Grasshopper 2.0 for a while now and there are a lot of facets to the project where I’d like to see improvement. One of the things I’m looking into is incorporating 3rd party code libraries that would provide advanced statistical and algorithmic functionality. One of the libraries I’ve been browsing is the Microsoft Solver Foundation whose main purpose is to solve optimization problems.

Looking for examples on how to use MSF I came across this blog post which poses a hypothetical (but not unrealistic) problem which can be cracked using an algorithmic solver. The post itself has some inconsistencies in the numbers it uses to describe the problem, so I’ll outline it briefly here as well:

  • You own a trading company and you’re looking to import some cheap-ass furniture from China and sell it to gullible Europeans for cut-throat prices. So the goal here is profit maximization.
  • We hire individual containers from a shipping company, and these containers have limits on how much weight you can put in. We’ll use a 500 kilogram limit in this tutorial. We are not concerned with volume limitations.
  • There are three types of furniture we can import; chairs, tables and bookcases (the original article uses cupboards instead of bookcases, but I wanted three items that start with different letters). Each has its own specific purchasing cost, profit margin and weight.
  • There is one additional constraint concerning purchasing cost; your company can only afford to spend 5000 Euros per container.

That’s it, simple enough. A profit function with three variables (amount of chairs, tables and bookcases respectively) and two hard constraints limiting the weight and purchasing costs.

Because of the weight and purchasing constraints this problem becomes non-trivial and an algorithmic approach would be good to have, if only to check your intuitions in case you’re doing the purchasing by hand.

This is a three-variable problem which means the phase space of all possible states is three-dimensional and the fitness landcape is a four-dimensional surface (if you don’t know what a fitness function/landscape is, please see this blog post or buy the Architectural Design publication with my article). I cannot draw a four-dimensional surface on a two-dimensional screen so instead I decided to just make sh*t up and hope it gets the point across regardless of being utter humbug:

Fitness Function with Constraints

Fitness Function with Constraints

Here you see a two-dimensional phase-space (let’s ignore bookcases for the time being and assume we only have to pick chairs and tables) with the resulting three-dimensional fitness landscape. From left to right we go from zero chairs to some arbitrary upper limit. From front to back we go from zero tables to some arbitrary upper limit. Every two-dimensional point is some combination (i.e. state) of chairs and tables. For every combination we can compute the profit that combination would generate, and this has been represented by the height of the landscape. The higher the surface, the more money we’re going to make. Find the highest point on the landscape and you know how many chairs and tables hit the sweet spot.

So far so good. But what about our constraints? Some combinations of chairs and tables are impossible either because they weigh too much or because they cost too much. So our landscape shouldn’t in fact exist at all when this is the case. This is known as a hard constraint and unfortunately in Galapagos it is impossible to add a hard constraint (at least for now).

Luckily we can mimic this effect with penalty clauses. We’re looking to maximize our profit, so what we’ll do is subtract a large amount of money if any of these constraints are overstepped. As long as we subtract enough, the solver will automatically avoid those areas. Enough theorizing, lets hook up some wires!

Read the rest of this entry »

This is going to be a difficult post to write on several levels. I’m about to lay heavy blame on a number of people and organisations who are part of my loyal customer base. I am also going to incriminate the industry I work for, which includes my direct employer and —of course— myself.

At the same time I have to marshal my arguments very carefully so as not to belittle a specific individual whose work I am about to dissect and critique in a harsh tone. I firmly believe —although he probably doesn’t agree with me— he is the victim of the twin evils of underinformed educators and overeager, heedless software companies. I wish neither to patronise nor scapegoat him and I know that cushioning language is not my strong suit. Please remember this while reading the text below. This is not about ridicule, but rather about a pervasive and alarming world-wide trend in academic architecture.

I’d like to thank ███████████ for sending me his work and having the courage to allow me to publicly critique it, knowing I’d focus only on the bad and not the good.

Also my sincere thanks goes out to K. who —unlike me— did receive proper academic training during her time at University (Turku University, Department of English) and was able to comment comprehensively on the scholarly and academic aspects of this critique.

Read the rest of this entry »

Sad cat is sad

May 27, 2013

A few days ago our #1 pet started showing signs of depression. It came on quite suddenly, sleeping a lot, loss of appetite, no purring, no grooming, no marking. There seem to be several common causes for cat depression, including a new pet (check), moving to a new place (well, yes, but we haven’t got there yet), changes to the living environment (check, the place has been a mess for weeks since we’re packing) and abandonment (check, with an additional cat in the house and a lot of work to do he’s been getting less attention).

We were getting quite worried about this as there are a lot of other stress factors in his near future and a cat that’s not eating is not long for this world. Today however he seems to be getting better at a rather spiffy rate (asking for food, jumping on window sills, playing with cat #2). From what we know about depression it’s unlikely to start and stop suddenly, unless it’s part of a bi-polar disorder.

But I haven’t watched the first 5 seasons of House* without learning anything and I think we may have found the actual problem. A few days ago we took the cat out into the car so he could get used to the space and the smell. Like the little furry chicken he is he immediately started struggling once it became obvious I wanted him inside the car. However once inside I let him go so he could explore in his own time. His plan of escape was well planned and expertly executed; one massive leap through the open window onto the safe green grass beyond. Unexpected complication was that the window in fact turned out not to be open after all.

It was quite a massive, head-first collision and given the rate at which symptoms occurred and then dissipated a concussion seems more likely. Unfortunately we didn’t think of it early enough to test for anisocoria.

 

* and yes, season 4 was lame and season 5 was terrible and I stopped watching after that.

I haven’t been posting much lately, reason being is that our life has taken a pretty significant turn. Since Christmas last year we’ve been looking for a new place to live. We finally found a nice little house in a quiet part of Tirol that’ll suit the three of us just fine. We have no idea yet how the cat will react to the local wildlife (should be mostly deer, foxes and hares), hopefully nothing untoward will happen.

House ourHouse = new House();

House ourHouse = new House();

None of us is a party animal and we’d much prefer to live amongst the trees than amongst the pubs. The place is somewhat on the small side for two people who both work from home, but we’ll make do.

One problem with living far away from centres of commerce is that we will need a car to get around. I don’t have a license yet but luckily K can serve as my chauffeur until I do. The house pretty much tapped all our savings but I think (read: hope) we’ve found something which will last us for the upcoming 5 years. It’s an early C220 CDI kombi model (2001), imported into Slovakia via Italy and owned by a succession of Mercedes dealership employees. We’ll be the fifth —and probably last— owners this car will ever have.

ICar hers = Mercedes.Create(220CDI);

ICar hers = Mercedes.Create(type.220CDI | type.Kombi);

Picture above from our first prolonged test-drive in the local mountains here. We had to pull over as apparently there was a bike race going on. Still, not a bad place to have a break. That’s Hotel Panorama on the ridge above us.

And lastly, we’ve also got a second cat. This wasn’t planned and we’re still hoping to find a new home for her before we drive to Austria next week. She showed up about a month ago on our window sill. Emaciated and scruffy. After a month or so with us she’s perked up a bit, though she is still very kittenish. Extremely social and hug-prone, but also a bit naughty so needs someone with firm hand who won’t shy away from smacking her over the head when she steals food.

Animal pet = furBall as Cat;

Animal pet = furBall as Cat;     //photo by K

After we’ve settled in I think I’ll be happy not having to make any major decision again for at least 3 years.

What’s the point? [6]

March 11, 2013

Fine, I lied. Post [5] wasn’t the last one after all. I’d like to propose one more optimization to the Quadtree/Octree algorithm as explained in the previous post.

As you may recall, spatial trees are clever because they [A] allow us to quickly —almost trivially— reject large portions of a dataset. The recursive nature of the spatial tree means that if we can reject a specific node, we can also reject all of its child-nodes. Furthermore, because the tree has variable density it is very well suited to handle unbalanced point collections.

We can reject a node in the spatial tree when the shortest distance from the search locus to the boundary of that node is already larger than the best answer we’ve found so far. From this it follows that the smaller the node region, the more quickly we can reject a node. Typically the regions of all child-nodes equal the region of the parent-node. However it’s only important to have a tree without gaps between nodes if we’re still in the process of inserting points into the tree. When we’re done adding points we can perform two fairly easy optimizations that will improve the search speed (though not enough to change the runtime-complexity).

Optimization number one involves shrinking every node region to contain exactly the points inside of it. Or —if the node contains child-nodes rather than points— shrinking the node region to contain exactly the regions of the child-nodes. By shrinking the node regions to the bare minimum we will increase the shortest distance from the search locus to the node region, which will —sometimes— result in earlier rejection.

Optimization number two involves simplifying the tree structure by removing nodes that contain exactly one child-node. We can ‘detach’ the single-child and plug it into the same slot as its parent used to occupy. By removing these pointless nodes we reduce the time spend traversing the tree.

QuadTreeOptimized

 

 

The image above shows the same Quadtree we build in the previous post, but now the nodes are colour-coded to reflect what sort of optimization we can apply to them. Orange nodes are empty and can be removed entirely. Green nodes contain points whose collective bounding-box is smaller than the node region. By removing empty nodes and shrinking the remainder, we end up with a vastly reduced total structure.

Once you apply these optimizations to a tree structure it is no longer possible to insert more points, as they might fall in the gaps that now exist between adjacent nodes. Unless you’re willing to make your point insertion smart enough to grow node regions on demand of course.

Seriously, I’ll stop harping on about spatial trees now.

What’s the point? [5]

March 9, 2013

Welcome to the —for now— last post in this series on algorithmic optimization. We talked a bit about generic optimization, then we introduced a specific problem and explored different ways of speeding up finding the solution to that specific problem. We talked about brute-force searching and why it’s not an option when we’re dealing with large point collections. We discussed one dimensional sorting and spatial bucketing and the merits and drawbacks of each algorithm. It is now time to discuss the ultimate of closest-point-searching algorithms, the weapon of choice for the informed and discriminating programmer; spatial trees.

But let us start by briefly rehashing the logic of the square-grid-search algorithm and why it’s not a particularly good one. By creating a grid of adjacent cells and assigning each point p in a collection C to whatever cell contains it, we gain the ability to explore C in a localized fashion. Although we may not know exactly which two points are direct neighbours, we do know which two cells are direct neighbours and therefore all the points in these cells can be treated as being in the same neighbourhood. This works splendidly while single cells do not contain too many points and while there aren’t too many completely empty cells. For balanced point collections we just need to pick a clever grid size, but for unbalanced collections we are rightly and truly screwed:

Large cells vs. small cells in unbalanced point collections.

Large cells vs. small cells in unbalanced point collections.

If we pick a large grid cell size then too many points end up in the same bucket and we’ve converted a smart search algorithm into a brute-force algorithm. If we pick a small grid cell size then there will be loads of empty cells and iterating over them will retard the search process.

If at this point you’re thinking “If brute-force search is a prohibiting factor in the case of grid cells with many points, why not put a smaller grid inside those cells?” you can feel real good about yourself. In fact, once we establish that it’s allowed to put a smaller grid inside the cell of a bigger grid, we can simplify the whole grid notion and only support grids that are composed of 2×2 cells. To create a nested grid structure like this, we only need a very short algorithm:

  1. Draw a rectangle around all points in C.
  2. If the number of points in this rectangle exceeds N, then subdivide the rectangle into four smaller rectangles.
  3. Assign each of the points in the parent rectangle to the appropriate child rectangle, and then repeat steps 2 & 3 for each child rectangle.

The above algorithm is recursive and the grid structure it generates is no longer one where all cells have the same size. If points are more densely packed in one region of space, the grid-structure will subdivide more often and it will result in smaller cells. If points are completely absent from another region of space the grid-structure will not subdivide at all and we won’t be plagued by a plethora of empty grid cells. If we apply this logic to an unbalanced point collection, we should get something like this:

QuadTreeStructure

 

Level 0 represents the top-most level of the grid. The number of points inside the outer boundary is too high and the region is split into four quadrants. The lower right quadrant still has too many points so it too is split into four quadrants. Then the process repeats one more time giving us a third level of subdivision. Now we have a grid containing ten cells of varying size containing the entire point collection. In fact two of the cells contain no points and can be removed from the grid structure in order to reduce memory overhead. The entire structure also contains three cells which host nested grids.

A spatial data structure like this is typically referred to as a tree in programming because the recursive and fractal nature of the algorithm resembles the branching patterns of biological trees. There are many different flavours of spatial trees, some more suited for specific problems, some easier to implement, some optimized for point insertion or deletion. What I’ve drawn above is usually called a Quadtree (or Q-tree) because it always subdivides into quadrants. If we extend this notion to 3D space and we have to subdivide a single box volume into eight sub-volumes and we call the resulting structure an Octree instead.

So how does searching in a Quadtree work? Remember that in order to perform a fast closest point search, we need to reject as many points as possible, limiting the search to only those points which are serious candidates. In some ways searching a Quadtree is easier than searching a fixed grid, one of the reasons being that we don’t actually have any empty cells. The search algorithm might work roughly like this:

  1. Build a tree T from the point collection C.
  2. Find the cell L in T which is closest to the search point h.
  3. Find the point p in L which is closest to h using —once again— brute-force search.
  4. Start recursing to sibling and parent cells to see if they contain points closer still.
  5. Every cell that has a minimum distance to h which is already larger than the smallest distance we found so far can instantly be skipped, as can all child-cells within it.

Step 5 is where the real magic happens. Spatial trees make it very easy to weed out enormous portions of C that are not near h. Adding points to an existing tree structure is also very easy, even if they are outside the outer bounds of the level 0 cell. All we need to do in those cases is insert another level underneath level 0 (which then by definition becomes level 1) in order to grow the reach of the tree. Every time we add such a bottom level we double the reach, so even inserting points which are really far away will not take intolerably long (unlike fixed grids which grow in a linear rather than exponential fashion).

As I mentioned in post #1, optimization is often a difficult process and code which is optimized often gains so much complexity as to become unreadable —at least without significant effort— by third parties. Only start optimizing code once you’ve determined beyond a shadow of a doubt where the bottleneck in your program is.

What’s the point? [4]

March 9, 2013

One of the main drawbacks of the adapted-sorted-brute-force search algorithm we developed in the previous post is that it only sorts the data along a single coordinate axis. For two dimensional data that means we’re ignoring half the information. For three dimensional data we’re ignoring twice as much information as we’re using to speed up the search. The more dimensions, the less efficient that searcher becomes.

The key to optimizing a Closest Point Search is to reject as many points as possible because measuring the distance between points takes time. It might not take a lot of time to compute the distance between two points, but if you want to find the nearest point in a collection of 10 million points for 1 million other points, you end up computing 107 * 106 = 1013 distances if you take the brute-force approach. Even if a single distance test only takes 5 nanoseconds it will still take 50,000 seconds, which is almost 14 hours. The adapted-sorted-brute-force (which wasn’t all that brute-force by the time we were done with it) managed to reject points because their distance measured along the sorting direction exceeded the best result we’ve found so far.

What we’d like to accomplish today is roughly the same, except we want to be able to reject points in all directions, not just one. Instead of sorting all the points, what we’ll do instead is put them in spatial buckets, and these buckets in turn will be defined in such a fashion as to allow very easy navigation to neighbouring buckets. In other words: a grid.

SpacialBucketing

In the image above you can see the entire point collection represented by the black dots. Point h is represented by the red cross and we’re looking for the black dot closest to h. This search algorithm also requires a pre-process step, but instead of sorting all  the points we’ll instead assign them to whatever grid square contains them. So instead of a single large collection of points (C), we end up with a grid structure where each grid cell contains a subset of C. This is a very cheap process as it only takes a single subtraction, multiplication and rounding operation per dimension per point*. It’s also very easy to multithread this pre-process as it belongs to a class of problems known as embarrassingly parallel. The algorithm —which we shall henceforth refer to as the square-grid-search algorithm— would work as follows:

  1. Create a grid of square cells that encompasses all of C.
  2. Assign all points in C to their respective grid cells.
  3. Find which cell r contains h (or which is closest to h if h is not within the grid bounds).
  4. Find the nearest point p in r to h using brute-force search.
  5. If r is empty, extend to search to all eight neighbouring cells and so on until at least one point is encountered.
  6. Find the collection of grid cells R which intersect with the search boundary (represented by the green circle).
  7. Iterate over all points in R using brute-force to see if any of them are closer than the early result.

The algorithm is fairly straightforward and easy to implement, but —like adapted-sorted-brute-force— is has serious drawbacks. It will perform very well on balanced point collections, but the performance heavily depends on the size of the grid cells. Too small cells and many of them will be empty, causing us to spend a lot of time iterating over pointless** areas. Too big and they’ll contain a lot of points which will slow down the algorithm as the intra-cell search is brute-force.

If the collection is unbalanced then we’ll end up with many empty cells and some cells which contain many points. As long as the intra-cell search remains brute-force, square-grid-search will not be suitable for such cases.

There is a new drawback as well. It’s quite simple and very fast to insert points into a sorted collection without breaking the sort order. It is also very easy to insert points into a spatial bucketing grid, provided the points are within the grid boundaries. If you wish to add points outside of the existing grid boundary, the grid will have to be grown, which can be a computationally expensive operation***.

Up next we’ll examine ways to make spatial bucketing more effective for unbalanced —indeed, wildly unbalanced— point collections.

* Assuming the Grid only has rectangular cells. Square cells make the algorithm much easier to implement though, not to mention hexagonal or free-form cells.

** Bow down before the Master of Puns!

*** It is possible to use a data structure known as a sparse array to store the grid cells rather than a standard multi-dimensional array. This effectively solves the insertion problem, but it also makes navigation between neighbouring cells slightly slower.

What’s the point? [3]

March 9, 2013

Remember the sorted brute-force search algorithm we came up with last time? It attempted to outperform regular brute-force searching by sorting the points left to right and then skipping as many points as possible by deciding that the best possible answer has already been found. We also mentioned that sometimes it makes more sense to iterate from right-to-left, or top-to-bottom or bottom-to-top depending on the relative location of the search locus (h) with respect to the point collection (C) as a whole.

But in many cases we have to work our way from one extreme of C to the point nearest to h before we can seriously start to omit points. If only we could start somewhere in the vicinity of h then the chances of finding the best possible answer quick will be much higher, which in turn means we can skip more points, which in turn means we’ll be done quicker. So an improved sorted brute-force search algorithm might look like:

  1. Sort the points in C by ascending x-coordinate (i.e. from left to right).
  2. Starting from a point p in C somewhere in the vicinity of h, iterate towards the right and compute the distance to h.
  3. If this distance is less than the previous smallest distance, remember this new distance as the smallest distance and remember p as the best result so far.
  4. If the x-coordinate of p is higher than the x-coordinate of h + the current smallest distance then stop iterating.
  5. Starting from the point p in C that was in the vicinity of h, iterate towards the left and compute the distance to h.
  6. If this distance is less than the previous smallest distance, remember this new distance as the smallest distance and remember p as the best result so far.
  7. If the x-coordinate of p is lower than the x-coordinate of h – the current smallest distance then stop iterating.
  8. When we’re done iterating, we’ll have remembered the one nearest to h.

The only difference between this adapted search algorithm and the original is that instead of iterating from one end of C towards the other, we start somewhere in the middle, at a point that is already near h. But how do we quickly find a point near h? Wasn’t that what we were trying to accomplish in the first place?

Well, not quite. We’re looking for the point closest to h. But for this adapted algorithm to work, all we need is a point somewhere in the vicinity of h. As long as it’s not way off, we should be able to get some benefit out of our smarter iteration logic. So the trick now is to find this magical point in the vicinity of h (let’s call it pbase) and of course we need to do so quickly because we’re trying to safe time, not spend it.

There is a more rigid definition of pbase than “in the vicinity of” and it may give a clue as to how we’re supposed to find it quickly. pbase can be defined as that point p in C whose x-coordinate is most similar to the x-coordinate of h. In other words, for this initial guess, we’re happy looking only at the distribution of C in the x-direction, ignoring the fact that C is a two-dimensional point collection and thus also has a distribution in the y-direction. There is a fantastically clever and fast and easy way to find a value in a sorted collection that is close to some other value. It is called Binary Searching.

A binary search progresses by taking a series of decreasing steps towards the right answer. It only requires that the data it is stepping over is sorted. A major benefit of binary searching is that the runtime complexity is O(log N), which is a very expensive way of saying that as the collection gets larger, it doesn’t take up correspondingly more time. Imagine for example that we use binary search to find an item in a collection containing 100 items. It will take —on average— N milliseconds to complete. If we now apply the same algorithm to a collection ten times the size, i.e. 1000 values, it might only take 2×N milliseconds rather than 10×N. If we make the collection bigger still, say 10000 values, it will take only 4×N milliseconds rather than 100×N.

Put even simpler, binary search —and other algorithms with a logarithmic runtime complexity— take proportionally less time to complete big problems as opposed to small problems.

Let’s have a look at how a binary searcher might find pbase in our point collection. The image below shows the points in C as grey circles, sorted from left to right. Note that the y-coordinate of each point  —while drawn— is irrelevant. Below the points are drawn the four steps that a binary searcher performs in order to arrive at the conclusion that point #3 is closest to h (×) if you only take the x-distribution into account. By a happy coincidence, point #3 is also the closest point in general, but that’s not necessarily the case.

BinarySearch

A binary searcher always operates on the same basic principle, which is then recursively applied over and over until a solution is found. We know that h is somewhere in between the extremes of the entire collection. We can ascertain this by making sure that h is to the right of the first point (0) and to the left of the last point (19). So we take the point halfway in between 0 and 19 and test whether it is to the left or to the right of h. That point happens to be #10 and it happens to be to the right of h. So now we know that the real answer must be somewhere between 0~10.

Once again we test the point halfway between 0 and 10, which is point #5. Same deal as before, 5 is to the right of h and we’ve now shrunk our search range to 0~5. If we look halfway between 0 and 5, we find point #2 which happens to be to the left of h, so instead of adjusting the upper limit of the search range we now adjust the lower and end up with a range of 2~5. Repeat until you can move neither left nor right without making things worse.

As you can see, the search range is effectively cut in half with every step. So when you double the number of items, it only takes one extra step to complete the search. Quadruple the number of items and it only takes two additional steps. This is why binary search is a logarithmic algorithm.

We’ve now developed a closest point search algorithm which actually runs pretty fast, especially on balanced point collections. We’ve gone from brute-force, to sorted-brute-force, to adapted-sorted-brute-force which is hardly brute-force at all any more.

However there are still deep problems with the core algorithm and if we want to address them we’ll have to drastically rethink our approach. The main issue is that the sorting only happens along a single axis. With well balanced, two-dimensional points that isn’t too big a deal, but if you start dealing with three or more dimensions it becomes increasingly less efficient.

Up next, spatial buckets and how they can be used to generate a notion of locality.

What’s the point? [2]

March 8, 2013

In the previous post we discussed some generic optimization strategies that might be applied to software to speed things up. Now we’ll look instead at a very well-known problem to see how being a smart programmer is still better than having a faster computer. The problem we’ll tackle is known as Closest Point Search, and it is interesting both because it is very simple to understand and very simple to write a naive algorithm that solves the problem.

Imagine a large collection of two-dimensional points (C) strewn across a flat surface. For the time being we’ll assume that the furthest distance between any two points in the collection is finite and that the average distance from any point to its neighbours is roughly the same. Such a point collection is called balanced. By contrast, an unbalanced collection would have very pronounced clusters where points group together. The further apart these clusters are the more unbalanced the collection is.

Balanced vs. Unbalanced

Balanced vs. Unbalanced

The question we want to answer is; “Which point (p) in this collection is closest to a location (h)?“. A naive algorithm which would answer this question is the brute-force implementation:

  1. For each point p in C, compute the distance to h.
  2. If this distance is less than the previous smallest distance, remember this new distance as the smallest distance and remember p as the best result so far.
  3. When we’re done iterating over all points, we’ll have remembered the point nearest to h.

The problem with this algorithm is that it has to iterate over all points in C. Which means that when there’s twice as many points, it’ll take twice as long. Ten times as many points, it’ll take ten times as long. It would be nice if we could somehow find a way to omit points from the search. If we can reject out of hand —say— half the points our algorithm will be twice as fast.

First however some bad news. If we’ll only ask the question “which is the point closest to h?” once then there’s absolutely nothing we can do*. We can only start to optimize the search if we’re certain that there will be multiple closest point searches performed on the same —or at least a very similar— collection.

So let’s consider an optimization to our brute-force search algorithm. We’ll add a pre-process step and also an escape clause:

  1. Sort the points in C by ascending x-coordinate (i.e. from left to right).
  2. For each point p in C, compute the distance to h.
  3. If this distance is less than the previous smallest distance, remember this new distance as the smallest distance and remember p as the best result so far.
  4. If the x-coordinate of p is higher than the x-coordinate of h + the current smallest distance then stop iterating.
  5. When we’re done iterating, we’ll have remembered the one nearest to h.

The trick here is the sorting. Since we know we’re now iterating from left to right, we also know that every new point we test will have an x-coordinate that is equal to or higher than the point we just tested. So once we find that the distance between this x-coordinate and the x-coordinate of h is larger than the smallest distance we’ve found so far**, we know for a fact that all remaining points cannot possibly yield a better answer then the one we already have. We can thus reject those from the search.

BruteForcePlusSorting

The image above provides a snap-shot of the search algorithm just prior to step 4. All the black points have already been tested and almost every new point turned out to be closer than the previous one. We’ve just tested the point that was nearest to × and the green circle represents the remaining area in which even closer points may be found. The right-most edge of the circle defines the location of the rejection boundary. The next point to be tested is just beyond this boundary and thus it and any subsequent points still left in the collection cannot possibly yield a better result then we have already.

This approach allows us to omit certain points and will thus be faster than the brute-force search. There is however an initial penalty since we have to sort the points first. But although this algorithm performs better, it still has many weaknesses. If the point collection has little variation along the x-direction for example then the benefit of sorting along the x-direction are minimal. Also when h is far outside the bounding box of C, then the shortest distance will always be quite large and the Rejection Boundary might never actually exclude any points, nullifying the optimization. Finally, if h is somewhere near the right edge of C, then the Rejection Boundary will be well outside of C for most of the time. One could of course choose to sort C along the y-direction or to iterate backwards instead of forwards depending on the relative location of h with respect to C, this will definitely alleviate some of the drawbacks at the expense of more complicated code. There is however another trick that can be used to significantly improve the performance of the sorted brute-force search. But for that we need to understand how binary searching works…

* Short of multithreading the Brute-force search. We could split the collection into N sub-collections (one for each available core), find the nearest point in each sub-collection and then find the best result amongst the N partial results.

** We are ignoring the contribution of the y-coordinate, which can only make the distance larger, never smaller. It is thus safe to ignore.