Extracting Similar Data: Using K-D Trees to Efficiently Locate Nearest Neighbors

Finding points of similar location using k-dimensional trees in large datasets — in this case geospatial (latlon) centroids

Daniel Ellis Research
4 min readMar 22, 2022

--

Photo by Kunj Parekh on Unsplash

Introduction

Finding data of similar magnitude has many applications in the data science community. Although similarity metrics or vector clustering may in some cases be used, we look at near-uniformly spaced values where doing so can take a long time to compute.

Why use K-Dimensional Trees?

Traditionally if we want to find which point is closest to another, we would calculate the distance between that point and every other, and then select the smallest one — this is the brute force method.

Complexity of the Brute Force Algorithm

As an example, we use two datasets of size len(n) = 34378 and len(m) = 175434 . The aim is to find the nearest neighbor from n which corresponds to each element in m. Ignoring any actual distance calculations and sorting, the minimum complexity of this task is O(n*m).

This means that for each value of m we must calculate the distance to each point in n as the bare minimum — giving us 6031070052 calculations. We then the sorting which at worst can also be O(n*n) and add any other calculation delays.

K-D trees

K dimensional trees on the other hand are constructed by iteratively splitting the Xdimensional hyperplane into sets of two (around the median), and cycling through each dimension in order.

2d decomposition of the branches of a KDtree
2D plane decomposition. Source: https://towardsdatascience.com/understanding-k-dimensional-trees-1cdbf6075f22

The process allows us to partition our data space and map it using a binary (decision) tree structure:

A KDtree
Tree structure. Source: https://towardsdatascience.com/understanding-k-dimensional-trees-1cdbf6075f22

This means that we are now able to navigate the data space without having to visit every point in the comparison…

--

--

Daniel Ellis Research

Research Software Engineer specialising in High-Performance Computing and Data Visualisation. — PhD in Atmospheric Chemistry and Masters in Theoretical Physics.