So, about KD-trees
A couple of days ago I have updated my KD-tree library adding a query for K-nearest neighbors. This had been raised as an issue in the repository for a while back, and since I had noticed the issue and added a development branch for the feature I had been conducting low-intensity warfare with a bug that meant that while most of the K-nearest neighbours returned by the query were correct, some were not (so, a partial success, which is the worst kind of success).
The problem was actually pretty simple, as these things usually tend to go but it still went unnoticed for a while before I was able to actually find it.
KD-trees derive their efficiency from placing a set of points in a tree structure, where the nearest-neighbor query is able to stop the search down a particular branch if it notices that no improvement in the best distance estimate can be attained further down the branch.
This is done by guaranteeing that at every node at depth d
of the tree the d
-th coordinate of all the points in the left branch of that node is smaller than that of the node, and conversely larger for those points on the right branch of the node.
At query time this is used by checking if the distance along the d
-th axis is larger than the current best.
To turn this into a K-nearest neighbours query simply means that now, during the query we keep track of a set of k
answers, and rather than cutting off when the best distance cannot be improved, we cut off when the k
-th best distance cannot be improved.
The catch is that this led to still stopping the search too soon, as if promising branches are encountered too soon in the search procedure will mean that we exit too early for that branch, and so we also must make sure that we have already filled our answer buffer with k
nearest neighbours.
This wasn't exactly a revelation, but it felt interesting enough to just drop a post. So the repository now has a version 2 without ever having a version 1 because I never expected to have versions in the first place, but changing the API (even though it was only an extension) seems like a good-enough reason to introduce a version number (also partly because there were pretty dramatic refactorings).